Author: Sultan Albogami
Last Updated: 3/31/2020
Description: Initial investigations on COVID-19 data state and county wise so as to discover patterns, spot anomalies, test hypothesis and check assumptions with the help of summary statistics and graphical representations.
Importing Libraries
import os
# !pip install numpy, run only for the first time.
import numpy as np
# !pip install pandas
import pandas as pd
# !pip install matplotlib
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from matplotlib import style
style.use('ggplot')
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip. Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue. To avoid this problem you can invoke Python with '-m pip' instead of running pip directly. Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.7/site-packages (1.18.2) WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip. Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue. To avoid this problem you can invoke Python with '-m pip' instead of running pip directly. Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.7/site-packages (1.0.3) Requirement already satisfied: python-dateutil>=2.6.1 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (2.8.1) Requirement already satisfied: pytz>=2017.2 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (2019.3) Requirement already satisfied: numpy>=1.13.3 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (1.18.2) Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas) (1.14.0) WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip. Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue. To avoid this problem you can invoke Python with '-m pip' instead of running pip directly. Requirement already satisfied: matplotlib in /srv/conda/envs/notebook/lib/python3.7/site-packages (3.2.1) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from matplotlib) (2.4.6) Requirement already satisfied: numpy>=1.11 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from matplotlib) (1.18.2) Requirement already satisfied: kiwisolver>=1.0.1 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from matplotlib) (1.1.0) Requirement already satisfied: python-dateutil>=2.1 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from matplotlib) (2.8.1) Requirement already satisfied: cycler>=0.10 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from matplotlib) (0.10.0) Requirement already satisfied: setuptools in /srv/conda/envs/notebook/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib) (45.1.0.post20200119) Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from python-dateutil>=2.1->matplotlib) (1.14.0)
Reading Data
os.chdir(r"/home/jovyan/")
df = pd.read_csv(r'util/data/us-states-03-30-20.csv')
df.head()
| date | state | fips | cases | deaths | |
|---|---|---|---|---|---|
| 0 | 2020-01-21 | Washington | 53 | 1 | 0 |
| 1 | 2020-01-22 | Washington | 53 | 1 | 0 |
| 2 | 2020-01-23 | Washington | 53 | 1 | 0 |
| 3 | 2020-01-24 | Illinois | 17 | 1 | 0 |
| 4 | 2020-01-24 | Washington | 53 | 1 | 0 |
df.shape
(1554, 5)
ax = plt.gca()
df.plot(kind='line', x='date', y='cases', figsize=(12, 8), ax=ax)
df.plot(kind='line', x='date', y='deaths', figsize=(12, 8), ax=ax)
plt.ylabel('Count')
plt.title('Increase of cases and deaths over time')
plt.show()
# Sum the cases and deaths
latest_sum = df.groupby(['state'])['cases', 'deaths'].agg('sum')
# Sort in descending order
latest_sum = latest_sum.sort_values(by=['cases', 'deaths'], ascending=False)
latest_sum.head(10)
/srv/conda/envs/notebook/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
| cases | deaths | |
|---|---|---|
| state | ||
| New York | 387714 | 4991 |
| New Jersey | 73838 | 897 |
| California | 47065 | 921 |
| Washington | 40990 | 2145 |
| Michigan | 31601 | 702 |
| Florida | 28846 | 401 |
| Massachusetts | 27827 | 252 |
| Illinois | 27243 | 342 |
| Louisiana | 23678 | 896 |
| Pennsylvania | 18863 | 194 |
# Plot the result
latest_sum.head(10).plot(kind='bar', figsize=(10, 6))
plt.ylabel('Count')
plt.title('Top 10 states with the most number of cases and deaths as of 03-30-2020')
plt.show()
Total Number of Cases and Deaths as of 2020-03-30
latest_total = df.groupby('date')['cases', 'deaths'].sum().reset_index()
latest_total = latest_total[latest_total['date']==max(latest_total['date'])].reset_index(drop=True)
latest_total
/srv/conda/envs/notebook/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead. """Entry point for launching an IPython kernel.
| date | cases | deaths | |
|---|---|---|---|
| 0 | 2020-03-30 | 163796 | 3073 |
# Extract new cases and deaths by date using loc.
present_stats = df.loc[df['date'] == '2020-03-30', ['date', 'state', 'cases', 'deaths']]
# Present death percentage
present_stats['death percentage'] = (present_stats['deaths'] / present_stats['cases']) * 100
# Sort in descending order
present_stats = present_stats.sort_values(by=['cases', 'deaths', 'death percentage'], ascending=False)
present_stats.head(10)
| date | state | cases | deaths | death percentage | |
|---|---|---|---|---|---|
| 1532 | 2020-03-30 | New York | 67174 | 1224 | 1.822134 |
| 1530 | 2020-03-30 | New Jersey | 16636 | 199 | 1.196201 |
| 1503 | 2020-03-30 | California | 7421 | 146 | 1.967390 |
| 1522 | 2020-03-30 | Michigan | 6508 | 197 | 3.027044 |
| 1521 | 2020-03-30 | Massachusetts | 5752 | 61 | 1.060501 |
| 1508 | 2020-03-30 | Florida | 5694 | 71 | 1.246927 |
| 1550 | 2020-03-30 | Washington | 5179 | 221 | 4.267233 |
| 1513 | 2020-03-30 | Illinois | 5070 | 84 | 1.656805 |
| 1539 | 2020-03-30 | Pennsylvania | 4156 | 48 | 1.154957 |
| 1518 | 2020-03-30 | Louisiana | 4025 | 186 | 4.621118 |
# Plot the result
present_stats.head(10).plot(kind='bar', x='state', y='death percentage', figsize=(10, 6))
# Set the plot title
plt.title('Top 10 states with the highest death death percentage as of 03-30-2020')
Text(0.5, 1.0, 'Top 10 states with the highest death death percentage as of 03-30-2020')
# !pip install plotly
# !conda install psutil --yes
import plotly.express as px
fig = px.bar(df , x='date', y='cases', color='state', labels={'y':'cases'},
hover_data=['state'],
title='Evolution of Reported COVID-19 Cases in the United States')
fig.show()
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Requirement already satisfied: plotly in /srv/conda/envs/notebook/lib/python3.7/site-packages (4.6.0)
Requirement already satisfied: retrying>=1.3.3 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from plotly) (1.3.3)
Requirement already satisfied: six in /srv/conda/envs/notebook/lib/python3.7/site-packages (from plotly) (1.14.0)
Collecting package metadata (current_repodata.json): done
Solving environment: done
==> WARNING: A newer version of conda exists. <==
current version: 4.8.2
latest version: 4.8.3
Please update conda by running
$ conda update -n base conda
# All requested packages already installed.
fig = px.bar(df , x='date', y='deaths', color='state', labels={'y':'cases'},
hover_data=['state'],
title='Evolution of Reported COVID-19 Deaths in the United States')
fig.show()
# Tree Map Visualization of COVID-19 Cases by Date and State
fig = px.treemap(df.sort_values(by='cases', ascending=False).reset_index(drop=True),
path=["state", "date"], values="cases", height=700,
title='Number of COVID-19 Cases by State and Date',
color_discrete_sequence = px.colors.qualitative.Prism)
fig.data[0].textinfo = 'label+text+value'
fig.show()
# Tree Map Visualization of COVID-19 Death Cases by State and Date
fig = px.treemap(df.sort_values(by='deaths', ascending=False).reset_index(drop=True),
path=["state", "date"], values="deaths", height=700,
title='Number of deaths from COVID-19 by State and Date',
color_discrete_sequence = px.colors.qualitative.Prism)
fig.data[0].textinfo = 'label+text+value'
fig.show()
df = pd.read_csv('util/data/us-counties-03-30-20.csv')
df.head()
| date | county | state | fips | cases | deaths | |
|---|---|---|---|---|---|---|
| 0 | 2020-01-21 | Snohomish | Washington | 53061.0 | 1 | 0 |
| 1 | 2020-01-22 | Snohomish | Washington | 53061.0 | 1 | 0 |
| 2 | 2020-01-23 | Snohomish | Washington | 53061.0 | 1 | 0 |
| 3 | 2020-01-24 | Cook | Illinois | 17031.0 | 1 | 0 |
| 4 | 2020-01-24 | Snohomish | Washington | 53061.0 | 1 | 0 |
df.shape
(21799, 6)
# Sum the cases and deaths
latest_sum = df.groupby(['county'])['cases', 'deaths'].agg('sum')
# Sort in descending order
latest_sum = latest_sum.sort_values(by=['cases', 'deaths'], ascending=False)
latest_sum.head(10)
/srv/conda/envs/notebook/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
| cases | deaths | |
|---|---|---|
| county | ||
| New York City | 223933 | 4047 |
| Westchester | 57820 | 69 |
| Nassau | 41585 | 228 |
| Suffolk | 34767 | 280 |
| Unknown | 26922 | 276 |
| King | 21128 | 1682 |
| Cook | 20221 | 201 |
| Wayne | 15509 | 306 |
| Bergen | 13194 | 247 |
| Los Angeles | 13173 | 214 |
# Plot the result
latest_sum.head(10).plot(kind='bar', figsize=(10, 6))
plt.ylabel('Count')
plt.title('Top 10 counties with the most number of cases and deaths as of 03-30-2020')
plt.show()
fig = px.bar(df, x='date', y='cases', color='county', labels={'y':'cases'},
hover_data=['county'],
title='Evolution of Reported COVID-19 Cases in the United States Counties')
fig.show()
fig = px.bar(df, x='date', y='deaths', color='county', labels={'y':'cases'},
hover_data=['county'],
title='Evolution of Reported COVID-19 Deaths in the United States Counties')
fig.show()
# Tree Map Visualization of COVID-19 Cases by County and Date
fig = px.treemap(df.sort_values(by='cases', ascending=False).reset_index(drop=True),
path=["county", "date"], values="deaths", height=700,
title='Number of deaths from COVID-19 by County and Date',
color_discrete_sequence = px.colors.qualitative.Prism)
fig.data[0].textinfo = 'label+text+value'
fig.show()
# Tree Map Visualization of COVID-19 Deaths by County and Date
fig = px.treemap(df.sort_values(by='deaths', ascending=False).reset_index(drop=True),
path=["county", "date"], values="deaths", height=700,
title='Number of deaths from COVID-19 by County and Date',
color_discrete_sequence = px.colors.qualitative.Prism)
fig.data[0].textinfo = 'label+text+value'
fig.show()